Predictive modeling and data analysis in a secure shared system

ABSTRACT

A system and method enables users to selectively expose and optionally monetize their data resources, for example on a web site. Data assets such as datasets and models can be exposed by the proprietor on a public gallery for use by others. Fees may be charged, for example, per new model, or per prediction using a model. Users may selectively expose public datasets or public models while keeping their raw data private.

RELATED APPLICATIONS

This application claims priority to and is a continuation of U.S. patentapplication Ser. No. 14/025,063, filed Sep. 12, 2013, which is anon-provisional of U.S. Provisional Application No. 61/710,175 filedOct. 5, 2012, both of which are incorporated herein in their entirety bythis reference.

COPYRIGHT NOTICE

©2012-2013 BigML, Inc. A portion of the disclosure of this patentdocument contains material which is subject to copyright protection. Thecopyright owner has no objection to the facsimile reproduction by anyoneof the patent document or the patent disclosure, as it appears in thePatent and Trademark Office patent file or records, but otherwisereserves all copyright rights whatsoever. 37 CFR §1.71(d).

TECHNICAL FIELD

This invention pertains to computer-implemented methods and apparatusfor machine learning or modeling digital datasets, and utilizing datamodels to make predictions about new data in a secure, private or sharedsystem.

BACKGROUND OF THE INVENTION

Machine Learning uses a number of statistical methods and techniques tocreate predictive models for classification, regression, clustering,manifold learning, density estimation and many other tasks. Amachine-learned model summarizes the statistical relationships found inraw data and is capable of generalizing them to make predictions for newdata points. Machine-learned models have been and are used for anextraordinarily wide variety of problems in science, engineering,banking, finance, marketing, and many other disciplines. Uses are trulylimited only by the availability and quality of datasets. Building amodel on a large dataset can take a long time. Further, the time andresources necessary to build a model increases as the required qualityor depth of the model increases. In view of these investments, somemodels are valuable to other users. Datasets themselves also may havevalue in view of the investment to acquire, check or scrub, and storethe dataset. In fact, many interesting datasets are built afterlaborious processes that merge and clean multiple sources of data. Oftendatasets are based on proprietary or private data that owners do notwant to share in their raw format.

SUMMARY OF PREFERRED EMBODIMENTS

The following is a summary of the invention in order to provide a basicunderstanding of some aspects of the invention. This summary is notintended to identify key/critical elements of the invention or todelineate the scope of the invention. Its sole purpose is to presentsome concepts of the invention in a simplified form as a prelude to themore detailed description that is presented later.

In one aspect, a system in accordance with the present disclosure mayenable an owner of a dataset to control permissions that would enableanother party or second user to use the dataset to make a new modelderived from the dataset. The owner or the system may charge the seconduser a fee to use the dataset. The fee may be shared with the owner.

In another aspect, a system in accordance with the present disclosuremay enable an owner of a model to control permissions that would enableanother party or second user to use the model to make a predictionsbased on a new input dataset provided by the second user. The owner orthe system may charge the second user a fee to use the model. The feemay be shared with the owner of the model. The model may be, forexample, a decision tree model.

According to another aspect, a dataset or a model may be displayed oradvertised in a public gallery. The display may contain a summary,thumbnail or metadata describing the dataset or the model, as the casemay be.

According to yet another aspect, a model may be published to a publicgallery in a black-box form that enables use of the model to makepredictions without disclosing its internal operation. In anotheraspect, a model may be published to a public gallery in a white-box formthat enables use of the model to make predictions and also discloses adescriptive and actionable version of the model. The actionable versionenables the user to understand operation and if desired modify themodel. Different fees may be charged for black-box and white-box models.Fees may be flat rate, per prediction, or based on other criteria.

Additional aspects and advantages of this invention will be apparentfrom the following detailed description of preferred embodiments, whichproceeds with reference to the accompanying drawings. The invention isnot intended to be limited by the drawings. Rather, the drawings merelyillustrate examples of some embodiments of some aspects of thisdisclosure.

Some portions of the detailed descriptions which follow are presented interms of procedures, logic blocks, processing, steps, and other symbolicrepresentations of operations on data bits within a computer memory.These descriptions and representations are the means used by thoseskilled in the data processing arts to most effectively convey thesubstance of their work to others skilled in the art. A procedure, logicblock, process, etc., is generally conceived to be a self-consistentsequence of steps or instructions leading to a desired result. The stepsrequire physical manipulations of physical quantities. Usually, thoughnot necessarily, these quantities take the form of electrical ormagnetic signals capable of being stored, transferred, combined,compared and otherwise manipulated in a computer system.

In short, the invention is intended to be implemented in software; i.e.,in one or more computer programs, routines, functions or the like. Thusit may best be utilized on a machine such as a computer or other devicethat has at least one processor and access to memory, as furtherdescribed later. In a preferred embodiment, a system is hosted on aserver to provide the features described herein to remote users. Theserver may comprise one or more processors. The server may be remotelyhosted “in the cloud.” Accordingly, in this description, we willsometimes use terms like “component,” “subsystem,” “model server,”“prediction server,” or the like, each of which preferably would beimplemented in software.

It should be born in mind that all of the above and similar terms are tobe associated with the appropriate physical quantities they representand are merely convenient labels applied to these quantities. Unlessspecifically stated otherwise as apparent from the followingdiscussions, it is appreciated that throughout the present invention,discussions utilizing terms such as ‘processing,’ ‘computing,’‘calculating,’ ‘determining,’ ‘displaying’ or the like, refer to theaction and processes of a computer system, or any other digitalelectronic computing device having a processor, that manipulates andtransforms data represented as physical (electronic) quantities withinthe system's registers and memories into other data similarlyrepresented as physical quantities within the system memories orregisters or other such information storage, transmission or displaydevices. The term processor encompasses multiple processors acting inconcert. A system may be located in one or more physical locations, e.g.distributed or housed “in the cloud” (a centralized location whereplural processors and related equipment may be housed and remotelyaccessed). Further, we use the term “user” in a broad sense relative toa processor or other computing platform, program or service. A user maybe a natural person or another processor or other computing platform,program or service. For example, an API may be used to enable anothermachine or processor to utilize a system of the type described herein.

Note that the invention can take the form of an entirely hardwareembodiment, an entirely software/firmware embodiment or an embodimentcontaining both hardware and software/firmware elements. In a preferredembodiment, the invention is implemented in software, which includes butis not limited to firmware, resident software, microcode, etc.Furthermore, the invention can take the form of a computer programproduct accessible from a computer-usable or computer-readable mediumproviding program code for use by or in connection with a computer orany instruction execution system. For the purposes of this description,a computer-usable or computer readable medium can be any apparatus thatcan contain, store, communicate, propagate, or transport the program foruse by or in connection with the instruction execution system,apparatus, or device.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a simplified process and system hybrid diagram, illustratingone example of an implementation consistent with the present disclosure,in which a first user interacts with a system to create a new datasource, a new dataset derived from the data source, a new model of thedataset, and a new prediction that utilizes the model of the dataset.

FIG. 2 shows the diagram of FIG. 1 amended to illustrate the first userpublishing the dataset to a public gallery.

FIG. 3 shows the diagram of FIG. 2 amended to illustrate a second useracquiring the dataset, and the dataset cloned and copied to the seconduser's workspace or “dashboard.”

FIG. 4 shows the diagram of FIG. 3 amended to illustrate the second userutilizing the acquired clone dataset in the second user's privatedashboard.

FIG. 5 shows the diagram of FIG. 1 amended to illustrate User #1utilizing the system to make a model public in a black-box way.

FIG. 6 shows the diagram of FIG. 5 amended to illustrate a second User#2 utilizing the User #1 black-box model to make predictions in the user#2 private dashboard.

FIG. 7 shows the diagram of FIG. 1 amended to illustrate the first userpublishing a white-box type of model to the public gallery.

FIG. 8 shows the diagram of FIG. 7 amended to illustrate the second userpurchasing access to the white-box public model, and showing thewhite-box model cloned and copied to the user #2 private dashboard.

FIG. 9 shows the diagram of FIG. 8 amended to illustrate the second userutilizing the cloned model to make a new prediction in the user #2private dashboard.

FIG. 10 shows an example of a source summary screen display.

FIG. 11 shows an example of a dataset summary screen display.

FIG. 12 shows an example of a graphical model screen display.

FIG. 13 shows an example of an interactive prediction user interface.

FIG. 14 shows a second example of a dataset summary screen display.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

FIG. 1 is a simplified process and system hybrid diagram, illustratingone example of an implementation consistent with the present disclosure.In the figure, a user #1 (reference 102) has access to her own privateworkspace or “dashboard” 100. In a preferred embodiment, a user mayinteract with the system using a website. Programs, data, etc, in theuser #1 dashboard 100 are not visible to other users. Users may beregistered with the system, and may have login names, and passwords, orutilize other known security measures. A registered user 102 can createa new data source 110 by uploading local data or using a URL that asource server 112 may use to retrieve the data. The data may be storedin a data store accessible to the user 102 and to the source server 112.In one example, the user may provide access credentials to enable thesource server to access selected data stored at 120. A data source orsource is the raw data that that is used to create a predictive model. Adataset can be a single dataset or a set of datasets. A dataset can becreated using a number of transformations over one or more datasets.

A source is usually a (large) file, and may be in a comma-separatedvalues (CSV) format, although many other formats may be used. Eachcolumn in the data file may represent a feature or field. A column,often the last column, usually represents the class or objective field.The file might have a first row named header with a name for each field.

A registered user 102 may transform her new source data 110 into adataset 134 as follows. An illustrative system further includes adataset server 130. The new data source 110 may provide raw data to thedataset server 130 as indicated by arrow 132. A dataset may comprise abinary and structured version of a source where each field has beenprocessed and serialized according to its type. A field may be numeric,categorical, a date, time, or text. Other filed types may be created. Inan embodiment, for each field, descriptive statistics includinghistograms that depict the distribution may be computed automatically,depending on the type of the field. In general the new dataset 134 isderived from the data source 110. Note the new dataset 134 is associatedwith user 102's private dashboard 100 so that the dataset is accessibleto user 102 but not available to anyone else at this juncture.

The user 102 may transform the new dataset 134 into a model. A model isa structured representation of the dataset with predictive power. Amodel may be created using a subset of fields of the dataset as inputfields, and another subset of the dataset fields as objective fields(the ones that the user wants the model to predict). A model can be asingle model or an ensemble of models. An ensemble can be created usingthe same or different datasets or techniques. Details of methods forcreating models are known. In the illustration of FIG. 1, the dataset134 is provided to a model server 140 as indicated by arrow 142. Themodel server creates a new model 144 based on the dataset and selectedparameters. For example, a model may use all the instances in thedataset to build the model, or just a subset as specified by the userthrough a number of sampling options.

Datasets and models may be enriched with various types of metadata, forexample:

-   -   Descriptive metadata about the origin or purpose of the data        like descriptions, tags, categories, or pictures.    -   Technical metadata. For example, number of rows, number of        fields, types of fields, type of algorithm, parameters of the        algorithm.    -   Privacy metadata that defines who is authorized to get access,        under what conditions and what restrictions apply.

Note in the present example the new model 144 is associated with user102's private dashboard 100 so that the model is accessible to user 102but not available to anyone else at this juncture. The model can be usedto make predictions on new input data that were not used to build themodel. The model uses the new input data to predict values of theobjective fields. Some examples are described later.

The user 102 may make predictions using a prediction server 150. Themodel 144 is accessible to the prediction server 150 as indicated byarrow 152. Further, the user provides new input data 154 to theprediction server for the purpose of predicting objective field values,based on applying the model to that new input data. The results form anew prediction 160. Note again that the model, the new input data, andthe new prediction 160 are all associated exclusively to the user 102secure dashboard 100. In an embodiment, each registered user may have anumber of sources, datasets, models, and predictions that are keptprivate in her dashboard. In some embodiments, interactions between auser and the system can be done through a user interface orprogrammatically through an API.

A user 102 may make her dataset 134 public. Referring now to FIG. 2, itshows a public gallery 200. The public gallery may be implemented, forexample, as a webpage on a website. Examples are shown below. Once adata set is made public, other users can have access to the summarystatistics for each individual field in the data set, but cannot haveaccess to the raw data that was used to build the data set. In someembodiments, the system can be exposed to the general public via apublic gallery where heterogeneous users can exchange datasets and/ormodels. In some embodiments, the system may be accessible only to themembers or a selected subset of members of the same organization. If thedata set is free or once a user has paid the requested fee, the data setcan be used within the system by that paying user (208) to create newmodels.

Preferably, unique identifiers are assigned to each resource in thesystem. For example, unique identifiers may be assigned to each source,dataset, model, ensemble, prediction, and evaluation. By way ofillustration, identifiers may look like this:

-   -   source/5202c67b035d072c00006974    -   dataset/5202ce99035d072bf9002476    -   model/521e8c01035d0750c6000a2a    -   ensemble/521e8ce1035d0750c6000a3f    -   prediction/521e8b05035d0750cd00079d    -   evaluation/521e8d3e035d0750c6000a46

Any identifiers that are at least unique within the system may be used.In use, for example, a software call to create a new model may includean identifier of the specific dataset to be used to create the newmodel. Resource identifiers can be used by users to refer to variousassets in their private dashboards or in a public gallery.

Referring again FIG. 2, a data set 202 has been made public by itsowner, here user 102, and placed in the public gallery 200. The datasetmay be placed in the public gallery by the dataset server 130. A seconduser 208 (or any user) now has access to the public data set 202. Theowner, in this example user 102, can decide whether other users can haveaccess to the data set for free or need to pay a fee before using thedata set. Such fees may be collected, for example in the context of aweb site, by way of bank transfers, credit cards, payment services suchas PayPal, etc.

Referring now to FIG. 3, this figure illustrates a data set, namelypublic data set 204, placed in the public gallery 200 by the data setserver 130. A 2^(nd) user 202 pays consideration to access or purchasethe data set 204. In this case, the data set is cloned to create a newprivate data set to 10. Private data set to 10 is placed in user 202'sprivate dashboard 212. The private dataset can be used by user 202 tocreate a new model using the model server 140. FIG. 4 illustrates thisscenario, in which the user 202 as acquired the private data set to 10.User 202 then utilizes the model server 140 to create a new model of herprivate data set to 10. The new model 220 is stored in the user'sprivate dashboard 212. Further, in FIG. 4, the 2^(nd) user 202 may applythe new model 222 the prediction server 150 in order to create a newprediction 240. The new prediction may be based on new input data 230,provided by the user 202 to the prediction server.

Evaluations provide a convenient way to measure the performance of apredictive model. A user can evaluate a model using a dataset of herown. In some embodiments, the server may provide various performancemetrics that provide an estimate of how well a given model predicts anoutcome as compared to its performance faced with data similar to thedataset tested.

In an embodiment, a user also can make a model public that she has builtusing a system of the type illustrated herein. A model may be uploaded,for example, into a user's private dashboard from an external source.Or, the user may choose to make public a model that she created usingthe system, as describe above (new model 144). Once a model is madepublic, other users may have access to a thumbnail picture, for example,that represents the model, and or other meta information about themodel. Meta data may include, for example, the number of fields andnumber of instances of the dataset used to build the model.

The owner of a model can control whether other users can have access tothe internal structure of the model, for example using a selected one ofat least two methods: black-box or white-box. If the owner so lacks ablack box method, other users will not be able to see how the modelworks internally, but they will still be able to use the model to makepredictions using their own input data. If the owner uses the white boxmethod, other users will have access to a descriptive and actionableversion of the model that explains how it works, in addition to theability to make predictions as conferred by black-box models. Forexample, an actionable version of a model may be provided in variousformats. These may include JSON, PMML, Microsoft Excel, a sequence ofif—then rules and various programming languages such as Python, Ruby,Java, etc.

FIG. 5 illustrates a scenario in which the owner as published ablack-box public model 300 into a public gallery 200. As shown, themodel 300 may be placed in the public gallery by the model server 140.The gallery display may include information about the model such as thenumber of fields and instances in the dataset.

FIG. 6 illustrates the use of the black-box public model 300 by a user208. In this illustration, the user 208 a pays consideration requestedfor use of the model, and then uses the black-box model 300 and theprediction server 150 in order to create a new prediction 302. The newprediction may be based on new input data 304 provided by user 208. Thenew prediction 302 is proprietary to the user 208 and thus is placed inthe user's private dashboard 212.

FIG. 7 illustrates a case where a white box public model 400 has beenpublished to the public gallery 200. Referring now to FIG. 8, itillustrates the user 208 paying consideration to purchase the white boxpublic model 400. In that case, the model 400 is cloned and a new model500 is formed in the user's private dashboard 212.

Turning now to FIG. 9, continuing the prior example, the user 208 cannow use her new model 500 and create a new prediction by using theprediction server 150 to apply new input data (not shown) to the model500. The resulting new prediction 502 is written to the user's privatedashboard 212.

Once a user makes a data set or model public, the corresponding data setor model may be exposed in both an individual gallery or dashboard, anda public gallery, such as the public gallery 200. In the public gallery,data sets and models can be filtered by an interested user, according tovariety of variables such as popularity, recency, white box, black-box,price and other attributes including but not limited to names,descriptions, tags, etc. In a preferred embodiment of the public gallerymay be implemented on a webpage.

FIG. 10 shows an example of a source summary screen display. Forexample, a display of this general type may be used in the context of apublic gallery on a web site to describe a data source. In this case,the data relates to the iris flowers. The raw data comprises variousinstances of input fields (sepal length, sepal width, etc.) and aspecies objective or output field. Values for some instances may bedisplayed. The field types may be indicated as well (for example, “123”indicating a numeric type, “ABC” indicating a category type). The designof this display is merely illustrative; the layout is not critical.

FIG. 11 shows an example of a dataset summary screen display. In thisexample, the dataset is derived from the source of FIG. 10. A display ofthis type may include counts for each field, and a histogram of valuesfor each field. A display of this type, for example, may be used in thecontext of a public gallery on a web site to describe a dataset. Thedesign of this display is merely illustrative; the layout is notcritical.

FIG. 12 shows an example of decision tree model visualization. Thisgraphic illustrates a model of the dataset of FIG. 11. For example,branch line widths may be used to indicate number of instances of sampledata. Each end leaf may correspond to a species of iris; the predictionsbased on the input variables indicated in the dataset. Other types ofvisualizations of models may be implemented.

FIG. 13 shows an example of an interactive prediction user interface. Inthis illustration, sliders are used for a user to input values of thenumeric variables, in order to predict a species of iris using the modelof FIG. 12.

FIG. 14 is another example of a dataset summary screen display. In thiscase the dataset reflects survivors of the famous sinking of the RMSTitanic (1912). The field names include name, age, class/department,fare etc. The second column indicates the type of each field (forexample, types of data may include text, numeric, binary, etc.). Foreach field, subsequent columns may indicate the respective count ofitems, number missing, and number of errors. These columns are merelyillustrative and not critical. On the right, a histogram indicates thecorresponding distribution of values for each field. For example, thefirst one is an age distribution. The next histogram suggests there wereseven different classes of passengers.

A person of ordinary skill in the art will recognize that they may makemany changes to the details of the above-described exemplary systems andmethods without departing from the underlying principles. Only thefollowing claims, therefore, define the scope of the exemplary systemsand methods.

We claim:
 1. A processor-implemented system comprising: (a) a sourceserver for managing access to data; (b) a dataset server for creatingand managing access to a dataset created from a data source; (c) a modelserver for creating and managing access to a model based on the dataset;(d) a prediction server for creating and managing access to a predictionthat results from utilizing the model; (e) a user interface componentthat implements a corresponding private dashboard for each one of pluralusers of the system; and (f) a public gallery component that, incooperation with the user interface, implements a public gallery toenable a first user to selectively expose a dataset for use by otherusers of the system.
 2. The system according to claim 1 wherein thepublic gallery component, in cooperation with the user interface,further enables a user to selectively expose a model for use by otherusers of the system.
 3. The system according to claim 2 wherein thesystem further enables the first user to selectively expose either awhite-box public model or a black-box public model for use by otherusers of the system.
 4. The system according to claim 1 wherein the[dataset server] system is arranged to clone the dataset exposed in thepublic gallery to form a copy, and to provide the cloned copy as aprivate dataset for use by a second user, responsive to an indication ofreceipt of consideration from the second user.
 5. The system accordingto claim 4 wherein the system is arranged to credit at least a portionof the consideration to an account of the first user.
 6. The systemaccording to claim 5 wherein the system further enables the second userto provide its private dataset to the model server to create a new modelaccessible in the second user's private dashboard.
 7. The systemaccording to claim 6 wherein the system further enables the second userto utilize its new model to create a new prediction accessible in thesecond user's private dashboard.
 8. The system according to claim 6wherein the first user elects to charge for each new prediction made bythe second user, and the system responsively requires receipt ofconsideration from the second user for each additional new predictioninitiated by the second user.
 9. A processor-implemented systemcomprising: (a) a source server for managing access to data; (b) adataset server for creating and managing access to a dataset createdfrom a data source; (c) a model server for creating and managing accessto a model based on the dataset; (d) a prediction server for creating aprediction from input data by utilizing the model; (e) a user interfacecomponent that implements a corresponding private dashboard for at leastone of plural users of the system; and (f) a public gallery componentthat implements a public gallery and enables a user to selectivelyexpose a dataset in the public gallery for use by other users of thesystem.
 10. The system according to claim 9 wherein the system furtherenables the first user to offer access to the exposed dataset to asecond user in exchange for a predetermined consideration.
 11. Thesystem according to claim 10 wherein the consideration is monetary. 12.The system according to claim 9 wherein the public gallery componentfurther enables a first user to selectively expose a model owned by thefirst user for potential use by other users of the system.
 13. Thesystem according to claim 10 wherein the public gallery componentenables the first user to selectively expose either a white-box publicmodel or a black-box public model for potential use by other users ofthe system.
 14. The system according to claim 11 wherein the system isarranged to clone the dataset exposed in the public gallery to form acopy, and to provide the cloned copy as a private dataset for use by asecond user, responsive to an indication of receipt of considerationfrom the second user.
 15. A non-transitory, machine readable storagemedium having stored thereon a series of instructions for causing one ormore processors to perform operations comprising: accessing a datasource to acquire raw data comprising a plurality of records of pluralfields; processing the acquired raw data to form a correspondingdataset; responsive to input from a first user who owns the dataset,publishing the dataset to a public gallery; collecting a fee from asecond user to utilize the published dataset; and cloning the dataset toform a copy for use by the second user.
 16. The non-transitory, machinereadable storage medium according to claim 15, the stored instructionsfurther causing the one or more processors to perform operationscomprising: creating a model of a dataset; responsive to input from afirst user who owns the model, publishing the model to a public gallery;and collecting a fee from a second user to utilize the published modelto make a new prediction.
 17. The non-transitory, machine readablestorage medium according to claim 16 wherein the new prediction is basedon applying the published model to new data provided by the second user.